Introduction to Open Data Science - Course Project

About the project

I’m began my PhD studies at University of Helsinki this Autumn and I’m expecting to be using R as part of my research project in musicology, employing a digital humanities approach. I’ve previously been taking courses on statistics (including analysis in R) via the open universities at University of Helsinki and the Hanken School of Economics. I recently got a new computer (and unfortunately lost most of my scripts from earlier courses), so I’m taking a fresh start on using R again now.

Hitherto, I’ve most often processed my numbers using spreadsheets such as Excel and Google Sheets, and I’ve developed techniques to do my calculations quickly in these programs. The downside to this is that I’ve most often felt that the threshold has been lower to use a spreadsheet than using R when processing numbers (and the quickest calculator is still the Spotlight in Mac Os X). So during this course, I hope to be able to learn some useful techniques for making sense of data quickly. Perhaps R Studio might become my program of choice some day. :)

The address to my Github repository is http://github.com/wilhelmkvist/IODS-project.

# This is a so-called "R chunk" where you can write R code.

date()
## [1] "Mon Dec 13 01:55:10 2021"

RStudio Exercise 2

Describe the work you have done this week and summarize your learning.

date()
## [1] "Mon Dec 13 01:55:10 2021"

1. Learning 2014 - an overview of the dataset

The dataset ‘Learning 2014’ contains data from a survey among students in the social sciences at University of Helsinki. The data was collected during an introductory course in statistics in late 2014 and early 2015. Participation was voluntary, but students were strongly encouraged to take part as they were reimbursed with extra points in the final exam. Respondents (N=183) were presented with a set of questions designed to reflect their attitudes towards statistics and university studies in general. Learning approaches were measured with questions designed to reflect deep, surface, and strategic approaches. The present dataset (a subset of a larger dataset) contains 166 observations (by all those who participated in the final exam) of 7 variables. In addition to mean scores reflecting attitude and learning approaches, the total points in the final exam are provided. Background information was provided on the respondents’ age and gender (female/male).

setwd("~/Documents/IODS-project")
learning2014 <- 
read.table("data/learning2014.csv", header = TRUE, sep = ",", stringsAsFactors = TRUE)
str(learning2014)
## 'data.frame':    166 obs. of  7 variables:
##  $ gender  : Factor w/ 2 levels "F","M": 1 2 1 2 2 1 2 1 2 1 ...
##  $ age     : int  53 55 49 53 49 38 50 37 37 42 ...
##  $ attitude: num  3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
##  $ deep    : num  3.58 2.92 3.5 3.5 3.67 ...
##  $ stra    : num  3.38 2.75 3.62 3.12 3.62 ...
##  $ surf    : num  2.58 3.17 2.25 2.25 2.83 ...
##  $ points  : int  25 12 24 10 22 21 21 31 24 26 ...

Below is an overview of the different variables. The course was clearly dominated by females, accounting for two of three students. Note the age span from 17 to 55 years (mean 25.5 years, median 22 years).

summary(learning2014)
##  gender       age           attitude          deep            stra      
##  F:110   Min.   :17.00   Min.   :1.400   Min.   :1.583   Min.   :1.250  
##  M: 56   1st Qu.:21.00   1st Qu.:2.600   1st Qu.:3.333   1st Qu.:2.625  
##          Median :22.00   Median :3.200   Median :3.667   Median :3.188  
##          Mean   :25.51   Mean   :3.143   Mean   :3.680   Mean   :3.121  
##          3rd Qu.:27.00   3rd Qu.:3.700   3rd Qu.:4.083   3rd Qu.:3.625  
##          Max.   :55.00   Max.   :5.000   Max.   :4.917   Max.   :5.000  
##       surf           points     
##  Min.   :1.583   Min.   : 7.00  
##  1st Qu.:2.417   1st Qu.:19.00  
##  Median :2.833   Median :23.00  
##  Mean   :2.787   Mean   :22.72  
##  3rd Qu.:3.167   3rd Qu.:27.75  
##  Max.   :4.333   Max.   :33.00

2. Exploring the dataset graphically

The chart below outlines attitude, learning approaches and final exam score (points) by gender. Males generally scored higher on attitude and deep learning approach, while females scored higher on strategic and surface learning approaches. The gender divide in the final exam was about equal, with only a few males outperforming females in the range with the very highest scores. Although males were generally slightly older, age and final score were not positively but negatively correlated for males (-0.24), for females the correlation was insignificant (-0.02). The strongest correlation was identified between attitude and final score (0.44), about equal between males and females. There was a slight correlation between attitude and deep learning approaches (0.11), even more so among males (0.17). However, there was no correlation between deep learning approaches and final scores (-0.01). Apart from attitude, demonstrating a strategic learning approach was positively correlated with points (0.15), while the demonstration of a surface learning approach was negatively correlated with final scores (-0.14).

library(ggplot2)
library(GGally)
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
ggpairs(learning2014, mapping = aes(col = gender, alpha = 0.3), lower = list(combo = wrap("facethist", bins = 20)))

3. Fitting a regression model

Based on the correlation between variables (analyzed above), I choose to fit a regression model in order to understand the impact of age, attitude and learning approach on exam points. As it turns out, the student’s attitude is highly significant (p < 0.001), while age and a strategic learning approach are moderately significant (p = 0.0981 and 0.0621 respectively).

options(scipen = 999) #to indicate the minimal p-values with zeros
model1 <- lm(points ~ age + attitude + stra, data = learning2014)
summary(model1)
## 
## Call:
## lm(formula = points ~ age + attitude + stra, data = learning2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -18.1149  -3.2003   0.3303   3.4129  10.7599 
## 
## Coefficients:
##             Estimate Std. Error t value      Pr(>|t|)    
## (Intercept) 10.89543    2.64834   4.114 0.00006171726 ***
## age         -0.08822    0.05302  -1.664        0.0981 .  
## attitude     3.48077    0.56220   6.191 0.00000000472 ***
## stra         1.00371    0.53434   1.878        0.0621 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.26 on 162 degrees of freedom
## Multiple R-squared:  0.2182, Adjusted R-squared:  0.2037 
## F-statistic: 15.07 on 3 and 162 DF,  p-value: 0.0000000107

For the sake of comparison, I choose to make a second model. Excluding the modestly significant variables ‘age’ and ‘stra’ does not really seem to make any big difference (I have not conducted a hypothesis test between the models, but the intercept and attitude remain highly significant in both models.)

model2 <- lm(points ~ attitude, data = learning2014)
summary(model2)
## 
## Call:
## lm(formula = points ~ attitude, data = learning2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.9763  -3.2119   0.4339   4.1534  10.6645 
## 
## Coefficients:
##             Estimate Std. Error t value      Pr(>|t|)    
## (Intercept)  11.6372     1.8303   6.358 0.00000000195 ***
## attitude      3.5255     0.5674   6.214 0.00000000412 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared:  0.1906, Adjusted R-squared:  0.1856 
## F-statistic: 38.61 on 1 and 164 DF,  p-value: 0.000000004119

4. Predicting exam scores

I choose to go forth with the first model (model1). According to the model, exam scores can be predicted if we know the person’s age, attitude and scores measuring their strategic learning approach. As stated earlier, age is slightly negatively correlated with exam score, while attitude is to a much larger degree positively correlated with exam score. A strategic learning approach is moderately correlated with exam scores.

Let us make two example calculations, one for a moderately motivated young undergraduate (“Mike”), another for a highly motivated late bloomer (“Ritva”). Mike, age 22, scores on attitude and strategic learning 2 out of 5. Ritva, age 55, cores on attitude and strategic learning 5 out of 5. We can now predict their exam scores:

students <- c('Mike','Ritva')
age <- c(22, 55)
attitude <- c(2,5)
stra <- c(2,5)
new_data <- data.frame(students, age, attitude, stra)
predict(model1, newdata = new_data)
##        1        2 
## 17.92358 28.46580

Effectively, R here makes the following calculations using these coefficients:

model1$coefficients
## (Intercept)         age    attitude        stra 
## 10.89542805 -0.08821869  3.48076791  1.00371238

Calculating predictions using the coefficients can be done in the following manner. (Differences in decimals are due to rounding errors.)

#Mike:
10.89543+(-0.08822*22)+(3.48077*2)+(1.00371*2)
## [1] 17.92355
#Ritva:
10.89543+(-0.08822*55)+(3.48077*5)+(1.00371*5)
## [1] 28.46573

The R-squared value (0.2182) can be used to summarize how well the regression line fits the data. Using the R-squared value, we see that the model makes a fairly good fit, explaining about 22 per cent of the sample variance. (In a simplified manner, one could state that a fifth of the variation in exam scores is explained by the students’ attitudes.)

summary(model1)$r.squared
## [1] 0.218172

5. Graphical model validation

In the following, I will produce three diagnostic plots in order to graphically explore the validity of my model assumptions. By analyzing residuals vs fitted values, I explore the validity of my model. Using the QQ-plot I explore whether errors are normally distributed. By exploring residuals vs leverage I want to find out whether there are any outliers. In this case, errors seem to be normally distributed, not correlated, and having a constant variance of Sigma^2. No severe outliers are identified.

par(mfrow = c(2,2))
plot(model1, which = c(1,2,5))


RStudio Exercise 3

Describe the work you have done this week and summarize your learning.

date()
## [1] "Mon Dec 13 01:45:46 2021"

1. Exploring alcohol consumption among Portuguese secondary school students - an overview of the Alc dataset

The Alc dataset provides data on alcohol consumption among Portuguese secondary school students aged 15 to 22 (mean 16.6 years). The data was collected using school reports and questionnaires. Attributes include student grades, demographic, social and school related features. The Alc dataset was constructed by joining two separate sets on student performance in two distinct subjects: Mathematics and Portuguese. As some students were known to have appeared in both sets, duplicates were identified and removed.

The Alc dataset features a total of 370 observations of 33 variables (listed below). Note: alc_use was calculated as a mean of workday alcohol consumption and weekday alcohol consumption, high_use is a binary variable indicating whether the student drinks more than 2 doses per day on average. The data was used in a paper by Cortez and Silva (2008). More information on the dataset along with attribute descriptions can be found at The Machine Learning Repository website.

alc <- read.table("data/alc.csv", header = TRUE, sep = ",", stringsAsFactors = TRUE)
colnames(alc)
##  [1] "school"     "sex"        "age"        "address"    "famsize"   
##  [6] "Pstatus"    "Medu"       "Fedu"       "Mjob"       "Fjob"      
## [11] "reason"     "guardian"   "traveltime" "studytime"  "schoolsup" 
## [16] "famsup"     "activities" "nursery"    "higher"     "internet"  
## [21] "romantic"   "famrel"     "freetime"   "goout"      "Dalc"      
## [26] "Walc"       "health"     "failures"   "paid"       "absences"  
## [31] "G1"         "G2"         "G3"         "alc_use"    "high_use"

There are a number of variables that can be studied to understand student alcohol consumption. To get a quick overview of the content and distribution of each variable we can use the following code:

fig.width=8
fig.height=6
library(tidyr); library(dplyr); library(ggplot2)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
gather(alc) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar()
## Warning: attributes are not identical across measure variables;
## they will be dropped

In order to find out which variables could possibly be statistically relevant, I begin by running a regression model explaining “alc_use” with all other variables (except for “Dalc”, “Walc” and “high_use” which are directly related to “alc_use”).

# exclude variables "Dalc","Walc","high_use" which are directly related to "alc_use"
selvarnames <- names(alc) %in% c("Dalc","Walc","high_use")
alc2 <- alc[!selvarnames == T]
# create formula with where "alc_use" is explained by select variables (stored in alc2)
fmla <- as.formula(paste("alc_use~",paste(names(alc2)[1:31],collapse="+")))
# create linear regression model and read the summary
model1 <- lm(data=alc2, formula = fmla)
summary(model1)
## 
## Call:
## lm(formula = fmla, data = alc2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.65003 -0.49853 -0.08492  0.38197  2.87986 
## 
## Coefficients:
##                   Estimate Std. Error t value        Pr(>|t|)    
## (Intercept)      -0.308078   0.928209  -0.332        0.740172    
## schoolMS         -0.206366   0.166819  -1.237        0.216943    
## sexM              0.453517   0.098064   4.625 0.0000053993241 ***
## age               0.067109   0.043688   1.536        0.125478    
## addressU         -0.232701   0.119806  -1.942        0.052951 .  
## famsizeLE3        0.159470   0.099574   1.602        0.110221    
## PstatusT          0.016944   0.148389   0.114        0.909161    
## Medu              0.021611   0.065772   0.329        0.742688    
## Fedu              0.055074   0.056854   0.969        0.333409    
## Mjobhealth       -0.252106   0.227986  -1.106        0.269622    
## Mjobother        -0.111991   0.146597  -0.764        0.445450    
## Mjobservices     -0.135048   0.166597  -0.811        0.418166    
## Mjobteacher      -0.066798   0.210155  -0.318        0.750799    
## Fjobhealth        0.066192   0.305988   0.216        0.828870    
## Fjobother         0.179164   0.224244   0.799        0.424887    
## Fjobservices      0.399310   0.231379   1.726        0.085325 .  
## Fjobteacher      -0.115857   0.271994  -0.426        0.670421    
## reasonhome        0.045740   0.113435   0.403        0.687044    
## reasonother       0.409977   0.164654   2.490        0.013270 *  
## reasonreputation  0.056384   0.117446   0.480        0.631488    
## guardianmother   -0.149224   0.108597  -1.374        0.170344    
## guardianother    -0.232964   0.235168  -0.991        0.322593    
## traveltime        0.152445   0.069143   2.205        0.028161 *  
## studytime        -0.115010   0.058181  -1.977        0.048903 *  
## schoolsupyes      0.021980   0.140610   0.156        0.875877    
## famsupyes        -0.057759   0.097938  -0.590        0.555763    
## activitiesyes    -0.185204   0.090385  -2.049        0.041249 *  
## nurseryyes       -0.261139   0.112122  -2.329        0.020461 *  
## higheryes         0.237373   0.241308   0.984        0.325989    
## internetyes       0.042568   0.131414   0.324        0.746201    
## romanticyes       0.052334   0.097849   0.535        0.593119    
## famrel           -0.178569   0.048777  -3.661        0.000293 ***
## freetime          0.069539   0.048266   1.441        0.150606    
## goout             0.293976   0.042174   6.971 0.0000000000173 ***
## health            0.056055   0.032523   1.724        0.085729 .  
## failures          0.169898   0.089496   1.898        0.058521 .  
## paidyes           0.287764   0.095692   3.007        0.002840 ** 
## absences          0.028011   0.008503   3.294        0.001094 ** 
## G1               -0.057193   0.037889  -1.509        0.132136    
## G2                0.042812   0.047726   0.897        0.370356    
## G3                0.006646   0.034650   0.192        0.848011    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8055 on 329 degrees of freedom
## Multiple R-squared:  0.4191, Adjusted R-squared:  0.3485 
## F-statistic: 5.935 on 40 and 329 DF,  p-value: < 0.00000000000000022

From the regression model summary, I can read that male sex is a highly significant variable as well as the quality of family relationships and going out with friends. The variables “paid” (extra paid classes in Math or Portuguese) and “absences” (number of school absences) are fairly significant. Moderately significant variables include study time, travel time, whether the student attended nursery school and whether the student has taken part in extra-curricular activities.

In order to understand the relationship between high alcohol consumption and select variables, I choose to visualize the relationship between high alcohol comsumption and sex/gender, family relationships, going out and study time. For the analysis, I construct a subset comprising columns 2 (sex), 22 (family relationship), 24 (going out), 14 (study time) and 35 (high_use. The visualizsation is presented in sex-disaggregated form.

library(ggplot2)
library(GGally)
#construct subset comprising columns 2 (sex), 22 (family relationship), 24 (going out), 14 (study time) and 35 (high_use)
alcpairs <- alc[, c(2, 22,24,14,35)]
#plot the pairs
ggpairs(alcpairs, mapping = aes(col=sex, alpha = 0.3), lower = list(combo = wrap("facethist", bins = 20)))

From the visual presentation, I can formulate the following working hypotheses for the select variables:

  • sex: men are more likely than women to be high consumers
  • famrel: students with high alcohol consumption generally score lower on the quality of family relationships
  • goout: high alcohol consumption is strongly related to the frequency of going out, especially for men
  • study time: students using much alcohol generally study fewer hours

The table below shows a summary of key statistics related to heavy and moderate drinkers (high_use = True/False) looking at four variables (three numeric and sex/gender).

alc %>% group_by(sex, high_use) %>% summarise(count = n(), famrel=mean(famrel), goout=mean(goout), studytime=mean(studytime))
## `summarise()` has grouped output by 'sex'. You can override using the `.groups` argument.
## # A tibble: 4 × 6
## # Groups:   sex [2]
##   sex   high_use count famrel goout studytime
##   <fct> <lgl>    <int>  <dbl> <dbl>     <dbl>
## 1 F     FALSE      154   3.92  2.95      2.34
## 2 F     TRUE        41   3.73  3.39      2   
## 3 M     FALSE      105   4.13  2.70      1.90
## 4 M     TRUE        70   3.79  3.93      1.63

The table above indicates what has been stated previously: that Portuguese young heavy drinkers generally go out more than their moderately drinking peers. Also, they spend spend less time studying and they estimate their family relations to be worse.

To indicate the relationship between family relations and heavy drinking, let’s draw a separate box plot. As the chart indicates, moderate consumers of alcohol have better family ties The difference between groups is not, however, as big as the boxplot visualization would initially indicate. Therefore, red dots have been added to indicate mean values by group and sex.

The boxplot visualization indicates that students with high alcohol consumption score lower on the quality of family relationships, but the difference in mean values between groups is much lower than the visualization focused on integer values indicates: the difference is only 0.34 score points for males and 0.19 for females. In this respect, the boxplot visualization is easily misleading.

NB! 1) In both groups there are outliers scoring very low on family relations. In general, however, respondents have scored fairly high (median=4). NB! 2) The figure says little about whether the quality of family relations are the cause or consequence of high alcohol consumption.

# initialise a plot of high_use and famrel
g <- ggplot(alc, aes(x = high_use, y = famrel, col = sex))
g + geom_boxplot() + ylab("family relation") + ggtitle("Quality of family relationships \n by alcohol consumption and sex") + theme(plot.title = element_text(hjust = 0.5)) + stat_summary(fun=mean)
## Warning: Removed 4 rows containing missing values (geom_segment).

We can justify the claim of a misleading visualization by adjusting the code for the previous table and also report the median values. Doing this allows us to see that there is no difference in median values between groups and sexes in family relations. Reporting median values will also tell us that there is a difference in the “goout” variable, with heavy drinkers more inclined towards going out. (The median study time remains the same, 4.)

alc %>% group_by(sex, high_use) %>% summarise(count = n(), famrel_mean=mean(famrel), famrel_median=median(famrel), goout_mean=mean(goout), goout_median=median(goout))
## `summarise()` has grouped output by 'sex'. You can override using the `.groups` argument.
## # A tibble: 4 × 7
## # Groups:   sex [2]
##   sex   high_use count famrel_mean famrel_median goout_mean goout_median
##   <fct> <lgl>    <int>       <dbl>         <dbl>      <dbl>        <dbl>
## 1 F     FALSE      154        3.92             4       2.95            3
## 2 F     TRUE        41        3.73             4       3.39            4
## 3 M     FALSE      105        4.13             4       2.70            3
## 4 M     TRUE        70        3.79             4       3.93            4

Applying logistic regression to understand the relationship between high alcohol usage and select variables

In the logistic regression model, the computational target variable is the log of odds. From this it follows that applying the exponent function to the fitted values gives us the odds. That is, the exponents of the coefficients can be interpreted as odds ratios between a unit change (vs no change) in the corresponding explanatory variable (according to the course video on Data Camp).

From this we see that with an odds ratio of over 2, males are more than twice as likely to have a “success” (that is, become heavy drinkers) compared to their female peers when controlling for family relations, going out and study time. The same goes for those student spending much time going out. On the other hand, those students with good family relations and spending much time studying are not as likely to become heavy drinkers; their odds ratios are only about 2/3 compared with their average peers.

The confidence intervals presented in the two rightmost columns (2.5 % and 97.5%) in the table below gives us an indication about the spread. From this we see for instance that there is a wider spread between males than between those going out. Although the odds ratio is about the same in both groups some males will have considerably higher odds compared to some outgoers.

The data presented here largely supports my initial working hypotheses about men being more likely than women to be high consumers, about students with high alcohol consumption generally scoring lower on the quality of family relationships, about high alcohol consumption being related to the frequency of going out and about lower study times being related to higher alcohol consumption.

However, revisiting the hypotheses reveals to me a somewhat erroneous wording and approach, focusing to much on the faults of those that I have called heavy drinkers (or those with a positive high_use variable). It would perhaps be more fair to comment on the relationship between high daily alcohol doses and background factors, and try to remove any judgmental attitudes.

Below is the printout of the summary of the logistic regression model and the table with odds ratios and confidence intervals.

m <- glm(high_use ~ sex + famrel + goout + studytime, data = alc, family = "binomial")
summary(m)
## 
## Call:
## glm(formula = high_use ~ sex + famrel + goout + studytime, family = "binomial", 
##     data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.5891  -0.7629  -0.4976   0.8304   2.6937  
## 
## Coefficients:
##             Estimate Std. Error z value       Pr(>|z|)    
## (Intercept)  -1.2672     0.7312  -1.733        0.08309 .  
## sexM          0.7959     0.2669   2.982        0.00286 ** 
## famrel       -0.4193     0.1399  -2.996        0.00273 ** 
## goout         0.7873     0.1230   6.399 0.000000000157 ***
## studytime    -0.4814     0.1711  -2.814        0.00490 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 452.04  on 369  degrees of freedom
## Residual deviance: 370.69  on 365  degrees of freedom
## AIC: 380.69
## 
## Number of Fisher Scoring iterations: 4
# compute odds ratios (OR)
OR <- coef(m) %>% exp
# compute confidence intervals (CI)
CI <- exp(confint(m))
## Waiting for profiling to be done...
# print out the odds ratios with their confidence intervals
cbind(OR, CI)
##                    OR      2.5 %    97.5 %
## (Intercept) 0.2816141 0.06545486 1.1622100
## sexM        2.2165443 1.31841735 3.7630180
## famrel      0.6575181 0.49791884 0.8636137
## goout       2.1974324 1.73873662 2.8198312
## studytime   0.6179355 0.43751933 0.8576353

Due to the chosen method at the start of this exercise, all of the variables controlled for were found to be statistically significant, one on a 0.1 per cent level and three on a 1 per cent level. I can therefore explore the predictive power of my model as such.

# predict() the probability of high_use
probabilities <- predict(m, type = "response")
# add the predicted probabilities to "alc"
alc <- mutate(alc, probability = probabilities)
# use the probabilities to make a prediction of high_use
alc <- mutate(alc, prediction = probabilities > 0.5)
# tabulate the target variable versus the predictions. The numbers are the count of individuals.
table(high_use = alc$high_use, prediction = alc$prediction) 
##         prediction
## high_use FALSE TRUE
##    FALSE   230   29
##    TRUE     59   52
# tabulate the target variable versus the predictions. The table shows the proportions.
table(high_use = alc$high_use, prediction = alc$prediction) %>% prop.table() %>% addmargins()
##         prediction
## high_use      FALSE       TRUE        Sum
##    FALSE 0.62162162 0.07837838 0.70000000
##    TRUE  0.15945946 0.14054054 0.30000000
##    Sum   0.78108108 0.21891892 1.00000000

Based on the data and the reported alcohol intake, exactly 30 per cent of respondents (111 of 370) were classified as heavy drinkers (high_use = true). Equally, 70 per cent (259 individuals) were classified as non-heavy drinkers (high_use = false). According to the model, 29 of 259 moderately drinking individuals (11 per cent) were erroneously predicted to be heavy drinkers. Out of 111 heavy drinkers, a majority (59) were erroneously predicted not be high_users although they were according on the data.

Although the model did fairly well in recognizing non-heavy users, it did worse in predicting heavy drinkers among those who in fact were (based on the reported intake). Overall, the model predicted 78 per cent of respondents to be non-heavy users, while the actual proportion was 70 per cent.

The proportion of wrongly classified individuals is displayed visually in the plot below. As the visualization makes clear, the proportion of inaccurately classified individuals (i.e. the training error) is fairly high. The mean prediction error can be computed by defining a loss function and comparing classifications and probabilities. The calculation indicates that nearly 24 per cent are wrongly classified. I would not have expected the analysis to give such a high number. On the other hand, the model could become more accurate with a higher number of observations.

# initialize a plot of 'high_use' versus 'probability' in 'alc'
g <- ggplot(alc, aes(x = probability, y = high_use, col = prediction))
# define the geom as points and draw the plot
g + geom_point()

# define a loss function (mean prediction error)
loss_func <- function(class, prob) {
  n_wrong <- abs(class - prob) > 0.5
  mean(n_wrong)
}
# call loss_func to compute the average number of wrong predictions in the (training) data
loss_func(class = alc$high_use, prob = alc$probability)
## [1] 0.2378378

A note on the usefulness of the model

At the loss rate of 24 per cent, it seems as if the model is not very accurate. Can I use the information? It depends on the purpose. For instance, if the goal was to target alcohol drinkers and feed them with advertisements on social media (assuming that heavy drinkers would be more inclined to buy booze), the information could definitely be useful (if one wanted to target alcohol ads at minors…). For identifying who will become an alcoholic in five years and who would need interrogative and intervening actions, the model is far too inaccurate. At this level of accuracy, one could perhaps only target information campaigns on the potential harms caused by drinking at an early age.

Bonus: Performing ten-fold cross-validation on the model

With ten-fold cross-validation, the prediction error seems to be larger than when using the loss function (with a difference of about one percentage point).

# Performing ten-fold cross-validation
library(boot)
cv <- cv.glm(data = alc, cost = loss_func, glmfit = m, K = 10)
# average number of wrong predictions in the cross validation
cv$delta[1]
## [1] 0.2405405
# calculating the difference between using ten-fold cross-validation and a loss function
cv$delta[1]-loss_func(class = alc$high_use, prob = alc$probability)
## [1] 0.002702703

Super-Bonus: Performing cross-validation to compare the performance of different logistic regression models

Finally, I will perform cross-validation to compare the performance of different logistic regression models using different sets of predictors. I start with the maximum number of predictors and explore the changes in the training and testing errors as I drop variables one by one. The original dataset contained 33 variables, but since high_use is directly related to Walc and Dalc, I choose to exclude these for computational feasibility.

In practice, I begin with creating a vector with running numbers in decreasing order from 31 to 1 indicating how many variables I will test at a time. I then create two more empty vectors, where I will save the results from the computational exercise. The three vectors will be used to construct the resulting data frame.

I then write a for loop to construct as many formulas as there are variables at any given time. I begin by making the cross-validation using the largest number of variables (31) The results are saved in a data frame along with the number of variables used. The resulting table will have three columns indicating Number of variables, Prediction error rate and Training error rate and 31 rows, one for every run.

Finally, I construct a plot indicating how both prediction and training errors decrease as the number of variables increase, with prediction errors always being greater. Looking at the resulting plot, I found it initially tough to digest the great variance and especially the initially increasing trend in prediction errors. But I guess that might have been be a result of a varying number of statistically significant variables being used in the different models.

The downside of the fairly long code written below is the time it takes to execute it. I have found that executing this last chunk alone - comparing different models with up to 31 variables - takes about three minutes. The time needed makes me relucant to run the code and test for any small changes, unless I have reduced the number of variables to a handful. If you have any suggestions on how to improve the code and speed up the process, I will gladly appreaciate your suggestions!

library(dplyr)
library(boot)
howmanyvar <- 31 #Enter here how many variables you want to test for. 31 is the maximum. (33 variables were included in the original dataset but two of these will always be excluded for computational feasibility and avoidance of near-perfect correlation.)
#create vector, in sequence, starting from the number above, descending by 1.
v <- rev(seq(1,howmanyvar))
#create empty numeric vectors of same length for the results.
trainingerrors <- integer(howmanyvar)
predictionerrors <- integer(howmanyvar)
# define a loss function (average prediction error)
loss_func <- function(class, prob) {
  n_wrong <- abs(class - prob) > 0.5
  mean(n_wrong)
}
for(i in v) {
#Within the for loop, I first create temporary subsets called "alctest". I choose to exclude variables "Dalc", "Walc" as these were found to cause warnings when executing the logistic regression model and counting probabilities (highly correlated with "high_use"). I also exclude "probability" and "prediction" as the probability has been calculated based on the whole dataset and this will now be replaced.
exclvarnames <- names(alc) %in% c("Dalc","Walc", "alc_use", "probability", "prediction")
alctest <- alc[!exclvarnames == T]
#From the alctest subset I will gradually drop columns one at a time, starting from the number of variables entered in "howmanyvar". In alctest, 32 has become the index number for the variable high_use. Therefore, I always want to include that one.
alctest <- dplyr::select(alctest, 1:v[i], 32)
#However, I want to exclude "high_use" from the vector with columns names that I want to use on the right-hand side in the formula.
fnames <- names(alctest)[names(alctest) !="high_use"]
# create formula where "alc_use" is explained by the variables
f <- as.formula(paste("high_use~",paste(fnames,collapse="+")))
# run a logistic regression
m2 <- glm(f, data = alctest, family = "binomial")
# predict() the probability of high_use using model m2
probabilities <- predict(m2, type = "response")
# add the predicted probabilities to "alctest"
alctest <- mutate(alctest, probability = probabilities)
# compute the average number of wrong predictions in the (training) data and save the result in vector
trainingerrors[i] <- loss_func(alctest$high_use,alctest$probability)
# K-fold cross-validation
cv <- cv.glm(data = alctest, cost = loss_func, glmfit = m2, K = nrow(alctest))
# compute the average number of wrong predictions in the cross validation and save the result in vector
predictionerrors[i] <- cv$delta[1]
}
results <- data.frame(variables=v, trainingerrors=trainingerrors, predictionerrors=predictionerrors)
results
##    variables trainingerrors predictionerrors
## 1         31      0.1918919        0.2567568
## 2         30      0.1891892        0.2621622
## 3         29      0.1945946        0.2540541
## 4         28      0.1972973        0.2540541
## 5         27      0.2162162        0.2783784
## 6         26      0.2108108        0.2621622
## 7         25      0.2162162        0.2702703
## 8         24      0.2162162        0.2729730
## 9         23      0.2675676        0.3162162
## 10        22      0.2594595        0.3270270
## 11        21      0.2729730        0.3216216
## 12        20      0.2675676        0.3297297
## 13        19      0.2702703        0.3297297
## 14        18      0.2783784        0.3270270
## 15        17      0.2810811        0.3162162
## 16        16      0.2783784        0.3162162
## 17        15      0.2810811        0.3081081
## 18        14      0.2810811        0.3081081
## 19        13      0.2729730        0.3135135
## 20        12      0.2783784        0.3162162
## 21        11      0.2729730        0.3324324
## 22        10      0.2918919        0.3189189
## 23         9      0.2864865        0.3108108
## 24         8      0.2918919        0.3162162
## 25         7      0.2945946        0.3054054
## 26         6      0.2945946        0.3054054
## 27         5      0.2918919        0.3027027
## 28         4      0.2972973        0.3135135
## 29         3      0.3054054        0.3135135
## 30         2      0.3000000        0.3000000
## 31         1      0.3000000        0.3000000
p <- ggplot(results, aes(x=variables)) + geom_line(aes(y=predictionerrors, color="prediction")) + geom_line(aes(y=trainingerrors, color="training"))
p + ggtitle("Relation between error rates\n and number of variables") + theme(plot.title = element_text(hjust = 0.5)) + xlab("Number of variables") + ylab("Error rate") + scale_color_discrete(name="Type of error")


RStudio Exercise 4

date()
## [1] "Mon Dec 13 01:55:18 2021"

The Boston dataset - a brief description

The Boston dataset, included in the MASS package, contains 506 observations of 14 variables relating to housing in the Boston region. Each observation describes a Boston suburb or town. Variables include information such as crime rate, air pollution, ethnic composition, proportion of land for large properties or industries, taxation, distances, communications and pricing. As has been state elsewhere the dataset has become a much-used pedagogical tool for teaching regression analysis and machine learning. First, let us swiftly explore the contents!

# accessing the MASS package
library(MASS)
# loading the Boston data
data("Boston")
# typing ?Boston will open the documentation on the variables in the R console. A description can also be found [here] (https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/Boston.html).
?Boston
# a summary of the content of the variables is given below
summary(Boston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08205   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00

It is perhaps hard to form an understanding of the relationship between the 14 variables simply by mapping pairs. Drawing a corrplot diagram allows us to get a quick overview of the correlation between variables. The greater the circle, the greater the correlance. Red indicates negative correlance, blue positive.

# plotting the relation between variables with corrplot
library(tidyr)
library(corrplot)
## corrplot 0.92 loaded
# rounding values
cor_matrix<-cor(Boston) %>% round(2)
# drawing a corrplot
corrplot(cor_matrix, method="circle", type = "upper", tl.cex = 0.6, tl.pos = "d")

# contrary to the instructions provided on Data Camp, I choose to exclude the 'cl.pos = "b"' command, as I prefer to have the color legend vertically on the right rather than horizontally below the figure.

What sticks out from the visualization above is the strong negative correlation between the distance to employment centres (dis) and the proportion of old houses (age), nitrogen oxides concentration in the air (nox) and proportion of non-retail businesses (indus).That is, the further away we are from employment centres, the higher the proportion of new houses, the higher the share of industrial properties and the higher the concentration of nitrogen oxides in the air.

Similarly, there is a strong positive correlation between accessibility to highways (rad) and property-tax rate (tax). That is, the better the access to highways, the higher the property tax rate.

In the following section, I will standardize the variables which allows me to compare them more easily and perform additional operations using them. (Ideally, I would explore the differences between the standardized and non-standardized variables graphically, for instance using the ggplot bar graph technique described here, but as this was not specifically demanded, I suspect this would perhaps be beyond the scope of this exercise.) For now, it will suffice to print out a summary of the variables in the standardized set.

From comparing the summaries, it can be seen that the range of select variables has decreased significantly (e.g. crim, zn, indus). In fact, the maximum value of crim (9.92) represents the maximum value of all standardized variables. As all variables have now been centered around a mean of zero, minimum values have all become negative for all variables.

# center and standardize variables
boston_scaled <- scale(Boston)
# change the object to data frame
boston_scaled <- as.data.frame(boston_scaled)
# summarize the scaled variables
summary(boston_scaled)
##       crim                 zn               indus              chas        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563   Min.   :-0.2723  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668   1st Qu.:-0.2723  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109   Median :-0.2723  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150   3rd Qu.:-0.2723  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202   Max.   : 3.6648  
##       nox                rm               age               dis         
##  Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331   Min.   :-1.2658  
##  1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366   1st Qu.:-0.8049  
##  Median :-0.1441   Median :-0.1084   Median : 0.3171   Median :-0.2790  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059   3rd Qu.: 0.6617  
##  Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164   Max.   : 3.9566  
##       rad               tax             ptratio            black        
##  Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047   Min.   :-3.9033  
##  1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876   1st Qu.: 0.2049  
##  Median :-0.5225   Median :-0.4642   Median : 0.2746   Median : 0.3808  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058   3rd Qu.: 0.4332  
##  Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372   Max.   : 0.4406  
##      lstat              medv        
##  Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 3.5453   Max.   : 2.9865

As instructed, I continue by creating a categorical variable for the crime rate using the quantiles as break points. I choose to name the variables from low to high. I then drop the old crime rate variable from the boston_scaled dataset.

# create a quantile vector of crim and print it
bins <- quantile(boston_scaled$crim)
# create a categorical variable 'crime'
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, label = c("low","med_low","med_high","high"))
# remove original crim from the dataset
boston_scaled <- dplyr::select(boston_scaled, -crim)
# add the new categorical value to scaled data
boston_scaled <- data.frame(boston_scaled, crime)

Moreover, I divide the dataset into train and test sets so that 80% of the data belongs to the train set. This is done as a means of preparation for fitting a linear discriminant analysis model on the train set.

# dividing the dataset into train and test sets, so that 80% of the data belongs to the train set. I randomly assign 80 per cent of the rows in the Boston dataset to the train set.
n <- nrow(Boston)
ind <- sample(n,  size = n * 0.8)
# create train set
train <- boston_scaled[ind,]
# create test set 
test <- boston_scaled[-ind,]

Now that I have created the train and test sets, I continue by fitting the linear discriminant analysis on the train set. I choose the categorical crime rate (crime) as target variable and all the other variables as predictor variables. I will also plot the LDA fit. The chunk below includes a code for defining a function for enriching the plot with arrows.

# fitting the linear discriminant analysis on the train set using the categorical crime rate as target variable and all other variables in the dataset as predictor variables.
lda.fit <- lda(crime ~ ., data = train)
# this function can be used for adding arrows to the biplot that will be plotted next
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}
# target classes as numeric
classes <- as.numeric(train$crime)
# plotting the lda results with arrows added
plot(lda.fit, dimen = 2, col=classes, pch=classes)
lda.arrows(lda.fit, myscale = 2)

The image drawn above gives an indication of in which direction the various variables draws the results. The longest arrow ic clearly given by the rad variable, zn and nox also pulling the results fairly strongly in different directions.

Continuing to follow the given instructions, I save the crime categories from the test data before removing the crime variable. This allows me to evaluate the correctness of predictions when using the test data to predict crime classifications. The cross tabulation of predictions and correct results is given below.

The results of the cross tabulation are interesting especially looking at the four corners. None of the areas where the crime rate was predicted low actually had high crime rates and vice versa. And similarly: where crime rates were predicted high, they actually were high, and where they were predicted low, they chiefly were low. Some variation occured nevertheless as to how right the prediction was. For instance, of those areas with low rates only half were correctly classified as areas with low rates (the other half were classified as med_low with one instance as med_high). It seems as if the model did a better job in getting areas with high criminal rates right than areas with low rates.

# saving crime categories from test data
correct_classes <- test$crime
# remove the crime variable from test data
test <- dplyr::select(test, -crime)
# predicting classes with the test data
lda.pred <- predict(lda.fit, newdata = test)
# cross tabulating the results with the crime categories from the test set (removed from the test set)
table(correct = correct_classes, predicted = lda.pred$class)
##           predicted
## correct    low med_low med_high high
##   low       15      10        1    0
##   med_low    6       8       11    0
##   med_high   1       1       18    0
##   high       0       0        0   31

I will now continue to execute the last step of the exercise proper. I begin by reloading the Boston dataset and standardize the variables. I will prepare a euclidean distance matrix to calculate the distances between the observations, and then run a k-means algorithm on the dataset to let the computer sort and clusterize the data. The clusters generated by the k-means algorithm will be plotted.

# reloading the Boston dataset
library(MASS)
data("Boston")
# standardizing the dataset
Boston <- scale(Boston)
# preparing a euclidean distance matrix
dist_eu <- dist(Boston)
# running a k-means algorithm on the dataset
km <-kmeans(Boston, centers = 3)
# plotting the results
pairs(Boston, col = km$cluster)

As is seen, the chart above looks really busy. We can determine the optimal number of clusters by plotting the results of a k-means algorithm run with the numbers from 1 to 10. The result is given below.

The optimal number of clusters is supposed to be when the curve drops sharply. In this case, there is no self-evident answer to what is the optimal number. In my interpretation, the drop is at its sharpest with two cluster centers, after which the decline slows down. I find it reasonable to cluster using two centers. A new plot is printed below.

# calculating the total within sum of squares (up to 10 clusters)
set.seed(123) #this command is used in conjunction with the function
twcss <- sapply(1:10, function(k){kmeans(Boston, k)$tot.withinss})
# visualizing the results
library(ggplot2)
qplot(x = 1:10, y = twcss, geom = 'line') + scale_x_continuous(breaks = c(2,4,6,8,10), limits=c(1,10))

# running again the k-means algorithm on the dataset with the newly determined optimal number of clusters
km <-kmeans(Boston, centers = 2)
# plotting the results with pairs
pairs(Boston, col = km$cluster)

For the bonus section, I will run the k-means algorithm on the original (standardized) Boston data with 3 cluster centers. An LDA is performed with clusters as target classes. The biplot with arrows is given below. As it appears as if zn, nox, medv and tax are the most influential linear separators for the clusters. That is, the proportion of residential land allocated for large properties, air pollution, price level and property tax rate seem to be the variables most strongly influencing which cluster a particular area belongs to.

# reloading the Boston dataset once more
library(MASS)
data("Boston")
# standardizing the dataset
Boston <- as.data.frame(scale(Boston))
# performing k-means on the dataset
km <-kmeans(Boston, centers = 3)
# saving clusters to be used with the LDA
clusters <- km$cluster
# adding the new clusters variable to the set
Boston <- data.frame(Boston, clusters)
# dividing the dataset into train and test sets to be used with the LDA
n <- nrow(Boston)
ind <- sample(n,  size = n * 0.8)
train2 <- Boston[ind,]
# removing chas because I repeatedly received the error message "variable 4 appears to be constant within groups"
train2 <- dplyr::select(train2, -chas)
# performing the lda
lda.fit2 <- lda(clusters ~ ., data = train2)
# defining the arrows function
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}
# plotting the lda results with arrows
plot(lda.fit2, dimen = 2, col=clusters, pch=clusters, ylim=c(-5,7),xlim=c(-5,5))
lda.arrows(lda.fit2, myscale = 4)

For the super bonus section, I apply the given code that helps me to create a matrix product, a projection of the data points that will be visualized. I will draw two 3D plots, one where the color is defined by the categorical crime classes, another where the color is defined by the clusters of the k-means. As can be seen here, the shape of the plots is identical. However, the colouring attributed by the categorical crime classes is much more ‘neat’, aiding the eye in forming an understanding of which elements that belong together. For instance, all yellow values are gathered at the left hand side of the chart (with high x values). Turning and twisting the graphic with the mouse helps understanding the data even better.

model_predictors <- dplyr::select(train, -crime)
# check the dimensions
dim(model_predictors)
## [1] 404  13
dim(lda.fit$scaling)
## [1] 13  3
# matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
matrix_product <- as.data.frame(matrix_product)
# accessing the plotly package in order to create a 3D plot of the columns of the matrix product. (I have ran the command install.packages("plotly") once already.)
library(plotly)
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:MASS':
## 
##     select
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color = ~classes)
# extracting the clusters from the second train set
cluscol <- train2$clusters
# drawing the second plot
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color = ~cluscol)

RStudio Exercise 5

date()
## [1] "Mon Dec 13 01:55:31 2021"

The Human dataset - an overview

The human dataset contains country-specific indicators relating to the Human Development Index (HDI), provided by the United Nations Development Programme. The index measures achievements in dimensions of human development such as health and longevity, education and standard of living.

Variables included in this exercise indicate the ratio of women to men in secondary education and labour force, as well as expected length of education and life. Variables also include Gross National Income as a measure of standard of living, and maternal mortality and adolescent birth rate as measures of the level of health care provided to young women.

# reading the human data from my data folder
human <- read.csv("./data/human.csv", stringsAsFactors = F, row.names = 1)
# exploring the data structure
head(human)
##              EduRatio  LabRatio EducExp LifeExp   GNI Matmort AdolBirthRt
## Norway      1.0072389 0.8908297    17.5    81.6 64992       4         7.8
## Australia   0.9968288 0.8189415    20.2    82.4 42261       6        12.1
## Switzerland 0.9834369 0.8251001    15.8    83.0 56431       6         1.9
## Denmark     0.9886128 0.8840361    18.7    80.2 44025       5         5.1
## Netherlands 0.9690608 0.8286119    17.9    81.6 45435       6         6.2
## Germany     0.9927835 0.8072289    16.5    80.9 43919       7         3.8
##             ParlRep
## Norway         39.6
## Australia      30.5
## Switzerland    28.5
## Denmark        38.0
## Netherlands    36.9
## Germany        36.9
summary(human)
##     EduRatio         LabRatio         EducExp         LifeExp     
##  Min.   :0.1717   Min.   :0.1857   Min.   : 5.40   Min.   :49.00  
##  1st Qu.:0.7264   1st Qu.:0.5984   1st Qu.:11.25   1st Qu.:66.30  
##  Median :0.9375   Median :0.7535   Median :13.50   Median :74.20  
##  Mean   :0.8529   Mean   :0.7074   Mean   :13.18   Mean   :71.65  
##  3rd Qu.:0.9968   3rd Qu.:0.8535   3rd Qu.:15.20   3rd Qu.:77.25  
##  Max.   :1.4967   Max.   :1.0380   Max.   :20.20   Max.   :83.50  
##       GNI            Matmort        AdolBirthRt        ParlRep     
##  Min.   :   581   Min.   :   1.0   Min.   :  0.60   Min.   : 0.00  
##  1st Qu.:  4198   1st Qu.:  11.5   1st Qu.: 12.65   1st Qu.:12.40  
##  Median : 12040   Median :  49.0   Median : 33.60   Median :19.30  
##  Mean   : 17628   Mean   : 149.1   Mean   : 47.16   Mean   :20.91  
##  3rd Qu.: 24512   3rd Qu.: 190.0   3rd Qu.: 71.95   3rd Qu.:27.95  
##  Max.   :123124   Max.   :1100.0   Max.   :204.80   Max.   :57.50
# plotting pairs
library(GGally)
ggpairs(human)

Below is a corrplot diagram of the correlation relations between variables. Based on the data, it is interesting to note that parliament representation does not seem to be particularly strongly correlated with anything. At least from the corrplot diagram, no strong correlation can be discerned (share of females in parliament is moderately correlated with labour force ratio, expected length of education and life expectancy). Neither does labour force ratio seem to be strongly correlated with anything.

# drawing a corrplot
library(corrplot)
cor(human) %>% corrplot()

Performing principal component analysis (PCA)

First, I will perform a principal component analysis (PCA) on the non-standardized human data, as instructed in the exercise. The variability captured by the principal components is given in the printout of values below. The plot highlights the impact of the Gross National Income (GNI) variable as this was the one variable containing the largest absolute numbers (maximum values over 100 times larger than in any other variable).

# perform principal component analysis (with the SVD method) on the human dataset in its *non-standardized* form
pca_human <- prcomp(human)
# exploring the variability captured by the principal components
summary(pca_human)$importance[2,]
##    PC1    PC2    PC3    PC4    PC5    PC6    PC7    PC8 
## 0.9999 0.0001 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
# draw a biplot of the principal component representation and the original variables
biplot(pca_human, choices = 1:2, cex = c(0.6, 0.9), col = c("grey40", "deeppink2"), main = "Impact of GNI on HDI")
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

Now, I will redo the PCA analysis with standardized variables.

# standardize variables
human_std <- scale(human)
# perform principal component analysis (with the SVD method) on the human dataset in its *standardized* form
pca_human_std <- prcomp(human_std)
# exploring the variability captured by the principal components
s <- summary(pca_human_std)$importance[2,]
s
##     PC1     PC2     PC3     PC4     PC5     PC6     PC7     PC8 
## 0.53605 0.16237 0.09571 0.07583 0.05477 0.03595 0.02634 0.01298
# save rounded percetanges of variance captured by each PC (to be used as axis labels)
pca_pr <- round(s*100, digits = 1)
# create object pc_lab (to be used as axis labels)
pc_lab <- paste0(names(pca_pr), " (", pca_pr, "%)")
# draw a biplot of the principal component representation and the original variables
biplot(pca_human_std, choices = 1:2, cex = c(0.6, 0.9), col = c("grey40", "deeppink2"), xlab = pc_lab[1], ylab = pc_lab[2], main = "Interconnected variables impacting HDI")

As was seen above, the first plot drawn from the non-standardized version was extremely hard to read as almost all results in the scatter plot were pushed into the upright corner into a really crowded space. Because of the high absolute values reported in the GNI variable, that is about the only arrow whose direction is visible in the first plot; I can’t even tell if there are any arrows pointing in any other direction (I suppose the arrows of ParlRep and LabRatio are pointing upwards, but they are at least not properly visible.)

The Human Development Index (HDI) is a summary measure of average achievement in key dimensions of human development: a long and healthy life, being knowledgeable and have a decent standard of living according to the UNDP.

From the latter plot we can discern three groups of variables that seem to influence the HDI index value in different ways. Life expectancy, maternal mortality and adolescent birth rate are all related to living a long and healthy life. Life expectancy is negatively correlated with maternal mortality and adolescence birth rate, and so the arrows here pull results in opposite directions.

The proportion of women in labour force and parliament (visible as arrows pointing upwards) can be interpreted as variables measuring women’s possibilities to participate in and impact society. The proportion of women in parliament is moderately correlated with the proportion of women in the labour force.

Interestingly, the ratio of women to men in secondary education is more strongly related to life expectancy than to the proportion of women in parliament or the ratio of women to men in the labour force. Being knowledgeable seems to be more strongly correlated with being able to live a long and healthy life than the ability to participate in working life and national politics.

Balancing national income and gender equality - interpreting the first two principal component dimensions of the PCA

The scatter plot presented through the PCA analysis is a very interesting one. The GNI variable that stood out in the first plot is now almost entirely hid behind the EduRatio variable (and thus very strongly correlated with that variable). In a way, this is the key to understanding the horizontal x axis of the plot. On the left-hand side, we find rich countries, both European, Asian and Arab. On the right-hand side, we find poorer countries, many of which have been hit by war and poverty and lack good public health services.

The vertical y axis is also an interesting one, although initially harder to label. I speculated over wheter it was a conservative-liberal axis or progressive-reactionary axis, before I concluded that the y axis is all about gender equality. Let me explain why.

The angle between a variable and a PC axis can be interpret as the correlation between the two. In the case of the x axis, the difference in angle is minimal between the variables at each horizontal end. As the difference in angle between the ParlRep and LabRatio variables and the y axis are almost as small (although significantly larger), one could understand the y axis as an axis reflecting equality between women and men (in politics and the workforce), placing Rwanda at the top and hard-line Arab states at the bottom. (Rwanda’s top placement might come in as a surprise for somebody, but the power balance between men and women in Rwanda, and attitudes and approaches chosen by Rwandian women, have been elaborated in some media articles for example here.)

In principal component analyses, the first principal component captures the maximum amount of variance from the features of the original data. In this instance, the PC1 (relating to life expectancy and health issues) explains just over half of the variance. But the gender equality aspects relating to female representation in the labour force and national politics account for only a sixth of the variance between countries. (Note how one single variable, the GNI explained 99 per cent of the variation in the erroneous first plot.)

A note on the side: What the plots - and the data - has not said anything about is the division of income and wealth distribution within countries. Attempts to quantify inequality have at various times been made using for instance the Gini coefficient, although this measurement unit has shortcomings of its own (it does not for instance take into account the effect of income redistributions, differences in living expenses between countries as well as distribution of wealth).

Exploring tea drinking habits

The tea dataset of the FactoMineR package includes 300 observations of 36 variables relating to tea drinking habits. Most variables are factors with 2 levels (i.e. binaries), but some variables include factors with more levels. Initially, I will visualise the data by drawing histograms of six of the most interesting variables. I will then conduct a Multiple Correspondence Analysis on the tea data and offer an interpretation the results of the MCA and draw a variable biplot.

# loading the tea dataset
library(FactoMineR) #I installed Factominer before writing this code
library(ggplot2)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following object is masked from 'package:MASS':
## 
##     select
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(tidyr)
data("tea")
# exploring the tea dataset
str(tea)
## 'data.frame':    300 obs. of  36 variables:
##  $ breakfast       : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
##  $ tea.time        : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
##  $ evening         : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
##  $ lunch           : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
##  $ dinner          : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
##  $ always          : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
##  $ home            : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
##  $ work            : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
##  $ tearoom         : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
##  $ friends         : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
##  $ resto           : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
##  $ pub             : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Tea             : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
##  $ How             : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
##  $ sugar           : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
##  $ how             : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ where           : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ price           : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
##  $ age             : int  39 45 47 23 48 21 37 36 40 37 ...
##  $ sex             : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
##  $ SPC             : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
##  $ Sport           : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
##  $ age_Q           : Factor w/ 5 levels "15-24","25-34",..: 3 4 4 1 4 1 3 3 3 3 ...
##  $ frequency       : Factor w/ 4 levels "1/day","1 to 2/week",..: 1 1 3 1 3 1 4 2 3 3 ...
##  $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
##  $ spirituality    : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
##  $ healthy         : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
##  $ diuretic        : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
##  $ friendliness    : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
##  $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ feminine        : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
##  $ sophisticated   : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
##  $ slimming        : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ exciting        : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
##  $ relaxing        : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
##  $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...
summary(tea)
##          breakfast           tea.time          evening          lunch    
##  breakfast    :144   Not.tea time:131   evening    :103   lunch    : 44  
##  Not.breakfast:156   tea time    :169   Not.evening:197   Not.lunch:256  
##                                                                          
##                                                                          
##                                                                          
##                                                                          
##                                                                          
##         dinner           always          home           work    
##  dinner    : 21   always    :103   home    :291   Not.work:213  
##  Not.dinner:279   Not.always:197   Not.home:  9   work    : 87  
##                                                                 
##                                                                 
##                                                                 
##                                                                 
##                                                                 
##         tearoom           friends          resto          pub     
##  Not.tearoom:242   friends    :196   Not.resto:221   Not.pub:237  
##  tearoom    : 58   Not.friends:104   resto    : 79   pub    : 63  
##                                                                   
##                                                                   
##                                                                   
##                                                                   
##                                                                   
##         Tea         How           sugar                     how     
##  black    : 74   alone:195   No.sugar:155   tea bag           :170  
##  Earl Grey:193   lemon: 33   sugar   :145   tea bag+unpackaged: 94  
##  green    : 33   milk : 63                  unpackaged        : 36  
##                  other:  9                                          
##                                                                     
##                                                                     
##                                                                     
##                   where                 price          age        sex    
##  chain store         :192   p_branded      : 95   Min.   :15.00   F:178  
##  chain store+tea shop: 78   p_cheap        :  7   1st Qu.:23.00   M:122  
##  tea shop            : 30   p_private label: 21   Median :32.00          
##                             p_unknown      : 12   Mean   :37.05          
##                             p_upscale      : 53   3rd Qu.:48.00          
##                             p_variable     :112   Max.   :90.00          
##                                                                          
##            SPC               Sport       age_Q          frequency  
##  employee    :59   Not.sportsman:121   15-24:92   1/day      : 95  
##  middle      :40   sportsman    :179   25-34:69   1 to 2/week: 44  
##  non-worker  :64                       35-44:40   +2/day     :127  
##  other worker:20                       45-59:61   3 to 6/week: 34  
##  senior      :35                       +60  :38                    
##  student     :70                                                   
##  workman     :12                                                   
##              escape.exoticism           spirituality        healthy   
##  escape-exoticism    :142     Not.spirituality:206   healthy    :210  
##  Not.escape-exoticism:158     spirituality    : 94   Not.healthy: 90  
##                                                                       
##                                                                       
##                                                                       
##                                                                       
##                                                                       
##          diuretic             friendliness            iron.absorption
##  diuretic    :174   friendliness    :242   iron absorption    : 31   
##  Not.diuretic:126   Not.friendliness: 58   Not.iron absorption:269   
##                                                                      
##                                                                      
##                                                                      
##                                                                      
##                                                                      
##          feminine             sophisticated        slimming          exciting  
##  feminine    :129   Not.sophisticated: 85   No.slimming:255   exciting   :116  
##  Not.feminine:171   sophisticated    :215   slimming   : 45   No.exciting:184  
##                                                                                
##                                                                                
##                                                                                
##                                                                                
##                                                                                
##         relaxing              effect.on.health
##  No.relaxing:113   effect on health   : 66    
##  relaxing   :187   No.effect on health:234    
##                                               
##                                               
##                                               
##                                               
## 
?tea
dim(tea)
## [1] 300  36
# moving on to visualization. I choose to visualize six of the variables that I find most interesting.
keep_columns <- c("age_Q","frequency","How","price","SPC","Tea")
tea_visu <- select(tea, one_of(keep_columns))
gather(tea_visu) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar() + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))
## Warning: attributes are not identical across measure variables;
## they will be dropped

# conducting the Multiple Correspondence Analysis. I continue with my selection of six interesting variables.
mca <- MCA(tea_visu, graph = FALSE)
summary(mca)
## 
## Call:
## MCA(X = tea_visu, graph = FALSE) 
## 
## 
## Eigenvalues
##                        Dim.1   Dim.2   Dim.3   Dim.4   Dim.5   Dim.6   Dim.7
## Variance               0.350   0.292   0.234   0.214   0.206   0.198   0.189
## % of var.              9.127   7.610   6.092   5.588   5.386   5.172   4.926
## Cumulative % of var.   9.127  16.737  22.830  28.417  33.803  38.975  43.901
##                        Dim.8   Dim.9  Dim.10  Dim.11  Dim.12  Dim.13  Dim.14
## Variance               0.181   0.180   0.178   0.170   0.166   0.152   0.149
## % of var.              4.730   4.705   4.636   4.431   4.334   3.977   3.897
## Cumulative % of var.  48.631  53.336  57.972  62.403  66.737  70.714  74.611
##                       Dim.15  Dim.16  Dim.17  Dim.18  Dim.19  Dim.20  Dim.21
## Variance               0.145   0.140   0.134   0.123   0.113   0.109   0.108
## % of var.              3.789   3.663   3.484   3.222   2.960   2.849   2.813
## Cumulative % of var.  78.400  82.064  85.547  88.769  91.729  94.577  97.390
##                       Dim.22  Dim.23
## Variance               0.055   0.046
## % of var.              1.423   1.188
## Cumulative % of var.  98.812 100.000
## 
## Individuals (the 10 first)
##                Dim.1    ctr   cos2    Dim.2    ctr   cos2    Dim.3    ctr
## 1           |  0.239  0.054  0.008 |  1.146  1.501  0.184 |  0.579  0.478
## 2           |  0.461  0.202  0.060 |  0.798  0.727  0.181 |  0.313  0.140
## 3           |  0.025  0.001  0.000 |  0.459  0.241  0.057 |  0.211  0.063
## 4           | -0.847  0.684  0.411 | -0.364  0.151  0.076 |  0.155  0.034
## 5           | -0.125  0.015  0.008 |  0.193  0.042  0.018 | -0.049  0.003
## 6           | -0.973  0.901  0.257 | -0.391  0.175  0.042 |  0.122  0.021
## 7           |  0.055  0.003  0.001 |  0.531  0.322  0.069 |  0.818  0.955
## 8           |  0.401  0.153  0.035 |  0.702  0.563  0.108 |  1.012  1.462
## 9           |  0.454  0.196  0.051 |  0.693  0.549  0.118 |  0.767  0.839
## 10          |  0.680  0.441  0.117 |  0.520  0.309  0.069 |  0.911  1.185
##               cos2  
## 1            0.047 |
## 2            0.028 |
## 3            0.012 |
## 4            0.014 |
## 5            0.001 |
## 6            0.004 |
## 7            0.163 |
## 8            0.225 |
## 9            0.144 |
## 10           0.210 |
## 
## Categories (the 10 first)
##                 Dim.1     ctr    cos2  v.test     Dim.2     ctr    cos2  v.test
## 15-24       |  -1.028  15.424   0.467 -11.817 |  -0.722   9.133   0.231  -8.303
## 25-34       |  -0.194   0.412   0.011  -1.834 |   0.556   4.059   0.092   5.252
## 35-44       |   0.438   1.220   0.030   2.972 |   0.967   7.123   0.144   6.558
## 45-59       |   0.410   1.627   0.043   3.580 |   0.732   6.227   0.137   6.396
## +60         |   1.721  17.867   0.429  11.332 |  -1.454  15.306   0.307  -9.577
## 1/day       |  -0.080   0.098   0.003  -0.947 |   0.450   3.670   0.094   5.302
## 1 to 2/week |  -0.320   0.717   0.018  -2.297 |  -0.096   0.077   0.002  -0.686
## +2/day      |   0.149   0.446   0.016   2.204 |  -0.262   1.658   0.050  -3.879
## 3 to 6/week |   0.084   0.038   0.001   0.519 |  -0.157   0.159   0.003  -0.969
## alone       |  -0.217   1.454   0.087  -5.107 |  -0.080   0.237   0.012  -1.882
##                 Dim.3     ctr    cos2  v.test  
## 15-24       |   0.285   1.775   0.036   3.275 |
## 25-34       |  -0.740   8.994   0.164  -6.995 |
## 35-44       |   0.865   7.127   0.115   5.870 |
## 45-59       |   0.058   0.048   0.001   0.502 |
## +60         |  -0.349   1.099   0.018  -2.296 |
## 1/day       |  -0.644   9.372   0.192  -7.580 |
## 1 to 2/week |   0.576   3.468   0.057   4.127 |
## +2/day      |   0.175   0.926   0.023   2.594 |
## 3 to 6/week |   0.400   1.297   0.020   2.475 |
## alone       |  -0.012   0.006   0.000  -0.271 |
## 
## Categorical variables (eta2)
##               Dim.1 Dim.2 Dim.3  
## age_Q       | 0.767 0.732 0.267 |
## frequency   | 0.027 0.097 0.211 |
## How         | 0.156 0.074 0.139 |
## price       | 0.162 0.097 0.207 |
## SPC         | 0.677 0.745 0.327 |
## Tea         | 0.310 0.005 0.250 |
plot(mca, invisible=c("ind"), habillage = "quali", graph.type = "classic")

As the distance between variable categories in a MCA provides a measure of their similarity, this gives me some hints as to how to interpret tea drinking habits of different population groups. The pink colour indicates occupation, black indicates age, blue indicates tea brand, red indicates tea drinking frequency, brown indicates kind of tea (green or black or other) and green indicates ways of drinking tea (e.g. with milk or lemon).

Heavily stereotyped, one could suggest that in the down-right corner we find non-workers and seniors who might have an occasional cup of tea of some cheap label every now and then. In the down-left corner, we find young students drinking private label teas, some of whom drink their teas fairly often, even twice a day. In the up-left corner, we find workers and employees having an occasional cup once a week or a regular cup once a day of some unknown or variable brand. In the up-right corner, we find the more posh, middle- and upper-class drinkers who do not necessarily drink tea that regularly (perhaps they prefer coffee), but who are presumably picky about their drink (the branded and upscale tea assortments are found here). Also, the closer we get to the middle and senior workers, the more green tea is consumed.

It’s worth noting, that the two dimensions plotted in the MCA factor map do not account for a significant amount of the variation - in fact, less than ten per cent each. That is, there are other variables who jointly explain much more of the variation in tea drinking habits.


RStudio Exercise 6

date()
## [1] "Mon Dec 13 01:55:42 2021"

In Exercise 6 we were asked to conduct analyses on the RATS and BPRS datasets following the techniques described in the course book, Vehkalahti and Everitt’s (2019) Multivariate Analysis for the Behavioral Sciences. The trick was to swap datasets, implementing the analyses of Chapter 8 using the RATS data and Chapter 9 using the BPRS data.

Exploring the RATS dataset

The RATS dataset contains data from a nutrition study conducted on three groups of rats (Crowder and Hand, 1990). Vehkalahti and Everitt (2019: 174) summarize the study design: “the three groups were put on different diets, and each animal’s body weight (grams) was recorded repeatedly (approximately weekly, except in week seven when two recordings were taken) over a 9-week period. The question of most interest is whether the growth profiles of the three groups differ.”

I begin by reading the data, converting categorical variables to factors, and confirming the structures. (I considered factorizing the time variable, but refrained from doing so as the time value is essentially continuous by nature.) Moreover, let’s take a look at what the data looks like currently.

# Reading the data. (I have already transformed the data into long form using the wrangling script.)
RATSL <- read.csv("./data/ratsl.csv", stringsAsFactors = T)
# Converting categorical variables to factors
RATSL$ID <- as.factor(RATSL$ID)
RATSL$Group <- as.factor(RATSL$Group)
# Exploring the structure of the dataset
str(RATSL)
## 'data.frame':    176 obs. of  4 variables:
##  $ ID    : Factor w/ 16 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ Group : Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 2 2 ...
##  $ Time  : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Weight: int  240 225 245 260 255 260 275 245 410 405 ...
# Viewing the data
head(RATSL)
##   ID Group Time Weight
## 1  1     1    1    240
## 2  2     1    1    225
## 3  3     1    1    245
## 4  4     1    1    260
## 5  5     1    1    255
## 6  6     1    1    260

On the Datacamp course site, the authors note that graphical displays of data are almost always useful for exposing patterns in the data, particularly when these are unexpected. Let’s begin by plotting the weights of all sixteen rats, differentiating between the groups into which they were divided.

library(dplyr); library(tidyr); library(ggplot2)
# Plotting rats
ggplot(RATSL, aes(x = Time, y = Weight, linetype = ID)) +
  geom_line() +
  scale_linetype_manual(values = rep(1:10, times=4)) +
  facet_grid(. ~ Group, labeller = label_both) +
  theme(legend.position = "none") + 
  scale_y_continuous(limits = c(min(RATSL$Weight), max(RATSL$Weight)))

The graphical overview makes it clear that group 1 consists of small rats (weight < 300 g). Groups 2 and 3 seem to refer to chiefly mid-size and large rats (about 400-500 g and 500-600 g). However, the division is not categorical (some overlap is visible). It is not entirely clear to me on what basis the rats have been grouped.

The graphical presentation above has the advantage of a crystal-clear grouping and the possibility to view individual progress, at least in groups 2 and 3. However, one might get an impression that the graphs are wasting visual space. An alternative form of presenting the information is given in Vehkalahti and Everitt (2019: 177), offering a quick overview of the development of the rats. A downside of this is that groups 2 and 3 are hard to separate from each other, as the linetypes are so similar. An attempt to present all groups on the same page but differentiate between groups using colours is given below. After this I will continue with the model above.

ggplot(RATSL, aes(x = Time, y = Weight, group = interaction(Group, ID), colour = Group, linetype = Group)) +
  geom_line() +
  theme(legend.position = "top") +
  scale_y_continuous(limits = c(min(RATSL$Weight), max(RATSL$Weight))) +
  scale_x_continuous(name = "Time (days)", breaks = seq(0, 60, 20))

Let’s explore the data further by doing a few exercises, standardising the Weight variable and plotting again with the standardised variable.

# Standardising the Weight variable and adding a column
RATSL <- RATSL %>%
  group_by(Time) %>%
  mutate(stdWeight = ((Weight-mean(Weight))/sd(Weight))) %>%
  ungroup()
# Glimpsing the data
glimpse(RATSL)
## Rows: 176
## Columns: 5
## $ ID        <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1, 2,…
## $ Group     <fct> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 1, 1, 1, 1, …
## $ Time      <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 8, 8, 8, 8, …
## $ Weight    <int> 240, 225, 245, 260, 255, 260, 275, 245, 410, 405, 445, 555, …
## $ stdWeight <dbl> -1.0011429, -1.1203857, -0.9613953, -0.8421525, -0.8819001, …
# Plotting again with the standardised Weight variable
ggplot(RATSL, aes(x = Time, y = stdWeight, linetype = ID)) +
  geom_line() +
  scale_linetype_manual(values = rep(1:10, times=4)) +
  facet_grid(. ~ Group, labeller = label_both) +
  theme(legend.position = "none") +
  scale_y_continuous(name = "standardized Weight", (limits = c(min(RATSL$stdWeight), max(RATSL$stdWeight))))

The graph drawn with standardized weights shows quite clearly how the weights of the rats are placed in relation to the mean, but also, the relative development of individuals and groups. In group 1, the development seem to have been fairly even, the change is hardly visible. In group 2, individuals move in opposite directions. In group 3, weigths seem to be about steady or declining.

Caution should be taken in drawing far-reaching conclusions from this data, especially considering the small number of individuals included in groups 2 and 3.

Let’s move on to summary graphs.

# Summary data with mean and standard error of RATS by treatment and week 
RATSS <- RATSL %>%
  group_by(Group, Time) %>%
  summarise(mean = mean(Weight), se = (sd(Weight)/sqrt(length(unique(RATSL$ID))))) %>%
  ungroup()
## `summarise()` has grouped output by 'Group'. You can override using the `.groups` argument.
# Glimpsing the data
glimpse(RATSS)
## Rows: 33
## Columns: 4
## $ Group <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2…
## $ Time  <int> 1, 8, 15, 22, 29, 36, 43, 44, 50, 57, 64, 1, 8, 15, 22, 29, 36, …
## $ mean  <dbl> 250.625, 255.000, 254.375, 261.875, 264.625, 265.000, 267.375, 2…
## $ se    <dbl> 3.805394, 3.273268, 2.868977, 3.400204, 2.764370, 2.945942, 2.73…
# Plotting the mean profiles
ggplot(RATSS, aes(x = Time, y = mean, linetype = Group, shape = Group)) +
  geom_line() +
  scale_linetype_manual(values = c(1,2,3)) +
  geom_point(size=3) +
  scale_shape_manual(values = c(1,2,3)) +
  geom_errorbar(aes(ymin=mean-se, ymax=mean+se, linetype="1"), width=0.3) +
  theme(legend.position = c(0.9,0.4)) +
  scale_y_continuous(name = "mean(Weight) +/- se(Weight)")

Basically, the graph above shows that the variance in groups 1 and 3 is very small, and only slightly bigger in group 2. From the graph it would be roughly discernible that the weight is rising faster in group 2 than in groups 1 and 3.

We can also create boxplots in order to observe outliers and identify thresholds for filtration of outliers.

# Create a summary data by Group and ID with mean as the summary variable (ignoring baseline Time 1).
RATSL8S <- RATSL %>%
  filter(Time > 1) %>%
  group_by(Group, ID) %>%
  summarise(mean=mean(Weight) ) %>%
  ungroup()
## `summarise()` has grouped output by 'Group'. You can override using the `.groups` argument.
# Glimpsing the data
glimpse(RATSL8S)
## Rows: 16
## Columns: 3
## $ Group <fct> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3
## $ ID    <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16
## $ mean  <dbl> 263.2, 238.9, 261.7, 267.2, 270.9, 276.2, 274.6, 267.5, 443.9, 4…
# Drawing a boxplot of mean versus Group
ggplot(RATSL8S, aes(x = Group, y = mean)) +
  geom_boxplot() +
  stat_summary(fun = "mean", geom = "point", shape=23, size=4, fill = "white") +
  scale_y_continuous(name = "mean(Weight), 64 days")

If we are to follow the recipe of the Datacamp course and MABS book precisely, I would as my next step filter out the outlier values in order to plot the data again. I can do so by cutting off values below 250 grams and above 550 grams. However, getting rid of the outlier of group 3 will be trickier - I might have to filter out rats under 500 grams in that group only.

Again, I want to stress the small sample size, since this poses considerable problems for drawing conclusions. I would even be willing to pose some questions relating to the design of the original study Should for instance groups 2 and 3 have been combined, since the rats in these groups are considerably larger than in group 1 and since the groups taken together would be equally large as group 1? Or should the groups have been mixed, so that an equal number of small and large rats would have been given the same diet?

# Filtering out outliers (rats weighting more than 550 and less than 250 grams)
RATSL8S1 <- RATSL8S %>% filter(RATSL8S$mean<550 & RATSL8S$mean > 250)
# Filtering the last outlier (in group 3, weighting less than 500 grams)
RATSL8S1 <- RATSL8S1[-which(RATSL8S1$Group == 3 & RATSL8S1$mean < 500),]
# Inspecting the summary reveals that one outlier has been eliminated in every group
summary(RATSL8S1)
##  Group       ID         mean      
##  1:7   1      :1   Min.   :261.7  
##  2:3   3      :1   1st Qu.:267.5  
##  3:3   4      :1   Median :276.2  
##        5      :1   Mean   :373.3  
##        6      :1   3rd Qu.:457.5  
##        7      :1   Max.   :542.2  
##        (Other):7
# Plotting the data again without outliers
ggplot(RATSL8S1, aes(x = Group, y = mean)) +
  geom_boxplot() +
  stat_summary(fun = "mean", geom = "point", shape=23, size=4, fill = "white") +
  scale_y_continuous(name = "mean(Weight), 64 days")

# Inspecting summaries by group
RATSL8S1 %>% 
  group_by(Group) %>% 
  summarize(mean = mean(mean))
## # A tibble: 3 × 2
##   Group  mean
##   <fct> <dbl>
## 1 1      269.
## 2 2      452.
## 3 3      538.

The plot drawn without outliers confirms largely what we knew from before: that group 1 consists of small rats, group 2 of mid-size and group 3 of large rats, weighting 269, 452 and 538 grams on average. The variance within groups is very small, partly due to the small sample size and the subsequent filtration of outliers.

Let’s do some calculations. Since I have three groups, I will not be performing simple t-tests between the groups since this could lead to the accumulation of type I errors. Instead, I’ll rely on ANOVA for testing.

The analysis confirms that the baseline value for Weight is strongly related to the Weight values taken on day 1 (p < 0.001). There is some evidence for the diet given to Group 2 being efficient (p = 0.07586). Printing a regression summary of the fit confirms that the diet given to Group 3 has no statistically significant impact on the weights of rats.

# Adding the baseline. According to the instructions on the Data camp course site, 
# we should be using the original data as the source for the new variable. 
# Since I never imported the original data, I now have to extract the baseline from 
# the imported data. (I did confirm that the extracted vector corresponds to that from the original data.)
RATSL8S2 <- RATSL8S %>%
  mutate(baseline = RATSL[RATSL$Time == 1,]$Weight)

# Fit the linear model with the mean as the response 
fit <- lm(mean ~ baseline + Group, data = RATSL8S2)

# A command used to prompt the use of zeros instead of scientific notation for p values
options(scipen=999)

# Compute the analysis of variance table for the fitted model with anova()
anova(fit)
## Analysis of Variance Table
## 
## Response: mean
##           Df Sum Sq Mean Sq   F value             Pr(>F)    
## baseline   1 253625  253625 1859.8201 0.0000000000000157 ***
## Group      2    879     439    3.2219            0.07586 .  
## Residuals 12   1636     136                                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Print a summary of the regression model for comparison.
summary(fit)
## 
## Call:
## lm(formula = mean ~ baseline + Group, data = RATSL8S2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -23.905  -4.194   2.190   7.577  14.800 
## 
## Coefficients:
##             Estimate Std. Error t value    Pr(>|t|)    
## (Intercept) 33.16375   21.87657   1.516      0.1554    
## baseline     0.92513    0.08572  10.793 0.000000156 ***
## Group2      34.85753   18.82308   1.852      0.0888 .  
## Group3      23.67526   23.25324   1.018      0.3287    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11.68 on 12 degrees of freedom
## Multiple R-squared:  0.9936, Adjusted R-squared:  0.992 
## F-statistic: 622.1 on 3 and 12 DF,  p-value: 0.0000000000001989

The confidence intervals confirm that 95 per cent of rats in Group 2 gained up to 76 grams of weight or lost at the most 6 grams, when comparing their mean weights with the baseline. About as large weight gains were reported in groups 1 (visible as the Intercept) and 3 (up to 80 grams when comparing mean weights with baseline). However, in these groups there was a greater possibility of losing weight during the observation period.

# The confidence intervals (at 95 %) are given below.
confint(fit)
##                   2.5 %    97.5 %
## (Intercept) -14.5011937 80.828690
## baseline      0.7383656  1.111899
## Group2       -6.1544447 75.869498
## Group3      -26.9892106 74.339724

Exploring the BPRS dataset

In the second part of the exercise we will be exploring the BPRS dataset using the techniques employed in Chapter 9 of the course book. Vehkalahti and Everitt (2019: 157) summarize the psychiatric study that provided the data for the dataset: “40 male subjects were randomly assigned to one of two treatment groups and each subject was rated on the brief psychiatric rating scale (BPRS) measured before treatment began (week 0) and then at weekly intervals for eight weeks. The BPRS assesses the level of 18 symptom constructs such as hostility, suspiciousness, hallucinations and grandiosity; each of these is rated from one (not present) to seven (extremely severe). The scale is used to evaluate patients suspected of having schizophrenia.”

Again, we will begin by reading the data, converting categorical variables to factors, exploring the structure of the data and plotting the data in order to explore patterns.

# Reading the data
BPRSL <- read.csv("./data/bprsl.csv", stringsAsFactors = T)
# Converting categorical variables to factors
BPRSL$treatment <- as.factor(BPRSL$treatment)
BPRSL$subject <- as.factor(BPRSL$subject)
# Confirming the structure of the dataset
str(BPRSL)
## 'data.frame':    360 obs. of  4 variables:
##  $ treatment: Factor w/ 2 levels "1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ subject  : Factor w/ 20 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ week     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ bprs     : int  42 58 54 55 72 48 71 30 41 57 ...
# Viewing the data
head(BPRSL)
##   treatment subject week bprs
## 1         1       1    0   42
## 2         1       2    0   58
## 3         1       3    0   54
## 4         1       4    0   55
## 5         1       5    0   72
## 6         1       6    0   48
# Plotting the BPRSL data
ggplot(BPRSL, aes(x = week, y = bprs, group = interaction(treatment, subject), inherit.aes = FALSE, colour = treatment)) + 
  geom_line() + 
  scale_x_continuous(name = "Weeks") + 
  scale_y_continuous(name = "BPRS score") + 
  theme(legend.position = c(0.85,0.85)) + 
  ggtitle("Impact of treatment on BPRS") + 
  theme(plot.title = element_text(hjust = 0.5))

From the plot above, it appears as if both treatments have a positive impact, generally lowering BPRS scores. However, simply by casting an eye on the chart it is hard to tell which treatment does a better job.

As a means of initial exploration, we can fit a linear regression model to try to understand the difference between the two models.

# creating a regression model
BPRS_reg <- lm(bprs ~ week + treatment, data=BPRSL)
# printing out a summary of the model
summary(BPRS_reg)
## 
## Call:
## lm(formula = bprs ~ week + treatment, data = BPRSL)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -22.454  -8.965  -3.196   7.002  50.244 
## 
## Coefficients:
##             Estimate Std. Error t value            Pr(>|t|)    
## (Intercept)  46.4539     1.3670  33.982 <0.0000000000000002 ***
## week         -2.2704     0.2524  -8.995 <0.0000000000000002 ***
## treatment2    0.5722     1.3034   0.439               0.661    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.37 on 357 degrees of freedom
## Multiple R-squared:  0.1851, Adjusted R-squared:  0.1806 
## F-statistic: 40.55 on 2 and 357 DF,  p-value: < 0.00000000000000022

From the summary of the regression model it can be seen that time (variable week) has a clearly significant impact. That is, the more weeks that pass, the lower the bprs score. However, we cannot find any difference between treatments, at least not when inspecting results using the regression model ignoring the repeated-measures structure.

Let’s move on to fit a random intercept model, which will allow the linear regression fit for each individual to differ in intercept from other individuals. We use the same explanatory variables: time (week) and treatment.

We will use the lme4 package, as instructed. The formula adheres to the basic linear regression formula standards, with the addition of random-effects terms distinguished by vertical bars (|).

# accessing the lme4 library
library(lme4)
## Loading required package: Matrix
## 
## Attaching package: 'Matrix'
## The following objects are masked from 'package:tidyr':
## 
##     expand, pack, unpack
# creating a random intercept model
BPRS_ref <- lmer(bprs ~ week + treatment + (1 | subject), data = BPRSL, REML = FALSE)
# Printing the summary of the model
summary(BPRS_ref)
## Linear mixed model fit by maximum likelihood  ['lmerMod']
## Formula: bprs ~ week + treatment + (1 | subject)
##    Data: BPRSL
## 
##      AIC      BIC   logLik deviance df.resid 
##   2748.7   2768.1  -1369.4   2738.7      355 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -3.0481 -0.6749 -0.1361  0.4813  3.4855 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  subject  (Intercept)  47.41    6.885  
##  Residual             104.21   10.208  
## Number of obs: 360, groups:  subject, 20
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)  46.4539     1.9090  24.334
## week         -2.2704     0.2084 -10.896
## treatment2    0.5722     1.0761   0.532
## 
## Correlation of Fixed Effects:
##            (Intr) week  
## week       -0.437       
## treatment2 -0.282  0.000

The printout contains a large number of interesting facts. Perhaps the most significant which I would like to pay attention to is the variability of the subjects’ scores, expressed by the standard deviation of 6.885 scores.

In the following, we can fit a random intercept and random slope model to the bprs data. This will allow the linear regression fits for each individual to differ in intercept and in slope, which will make it possible to account for the individual differences in the subjects’ developments and the effect of time.

# creating a random intercept and random slope model
BPRS_ref1 <- lmer(bprs ~ week + treatment + (week | subject), data = BPRSL, REML = FALSE)
# printing a summary of the model
summary(BPRS_ref1)
## Linear mixed model fit by maximum likelihood  ['lmerMod']
## Formula: bprs ~ week + treatment + (week | subject)
##    Data: BPRSL
## 
##      AIC      BIC   logLik deviance df.resid 
##   2745.4   2772.6  -1365.7   2731.4      353 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -2.8919 -0.6194 -0.0691  0.5531  3.7976 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev. Corr 
##  subject  (Intercept) 64.8222  8.0512        
##           week         0.9609  0.9802   -0.51
##  Residual             97.4305  9.8707        
## Number of obs: 360, groups:  subject, 20
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)  46.4539     2.1052  22.066
## week         -2.2704     0.2977  -7.626
## treatment2    0.5722     1.0405   0.550
## 
## Correlation of Fixed Effects:
##            (Intr) week  
## week       -0.582       
## treatment2 -0.247  0.000
# perform an ANOVA test on the two models
anova(BPRS_ref1, BPRS_ref)
## Data: BPRSL
## Models:
## BPRS_ref: bprs ~ week + treatment + (1 | subject)
## BPRS_ref1: bprs ~ week + treatment + (week | subject)
##           npar    AIC    BIC  logLik deviance  Chisq Df Pr(>Chisq)  
## BPRS_ref     5 2748.7 2768.1 -1369.4   2738.7                       
## BPRS_ref1    7 2745.4 2772.6 -1365.7   2731.4 7.2721  2    0.02636 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

According to the printout of the random intercept and random slope model, the individual standard deviation has grown to over 8 now. The ANOVA test of the two models suggests that the random intercept and random slope model (BPRS_ref1) makes a better fit against the random intercept model (p = 0.026). Thus, the individual slopes seem to have an impact.

We can now fit a random intercept and slope model that allows for interaction between time and treatment.

# creating a random intercept and random slope model with an interaction term
BPRS_ref2 <- lmer(bprs ~ week + treatment + (week | subject) + week*treatment, data = BPRSL, REML = FALSE)
# printing out a summary of the model
summary(BPRS_ref2)
## Linear mixed model fit by maximum likelihood  ['lmerMod']
## Formula: bprs ~ week + treatment + (week | subject) + week * treatment
##    Data: BPRSL
## 
##      AIC      BIC   logLik deviance df.resid 
##   2744.3   2775.4  -1364.1   2728.3      352 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -3.0512 -0.6271 -0.0768  0.5288  3.9260 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev. Corr 
##  subject  (Intercept) 64.9964  8.0620        
##           week         0.9687  0.9842   -0.51
##  Residual             96.4707  9.8220        
## Number of obs: 360, groups:  subject, 20
## 
## Fixed effects:
##                 Estimate Std. Error t value
## (Intercept)      47.8856     2.2521  21.262
## week             -2.6283     0.3589  -7.323
## treatment2       -2.2911     1.9090  -1.200
## week:treatment2   0.7158     0.4010   1.785
## 
## Correlation of Fixed Effects:
##             (Intr) week   trtmn2
## week        -0.650              
## treatment2  -0.424  0.469       
## wek:trtmnt2  0.356 -0.559 -0.840
# performing an ANOVA test on the two models
anova(BPRS_ref2, BPRS_ref1)
## Data: BPRSL
## Models:
## BPRS_ref1: bprs ~ week + treatment + (week | subject)
## BPRS_ref2: bprs ~ week + treatment + (week | subject) + week * treatment
##           npar    AIC    BIC  logLik deviance  Chisq Df Pr(>Chisq)  
## BPRS_ref1    7 2745.4 2772.6 -1365.7   2731.4                       
## BPRS_ref2    8 2744.3 2775.4 -1364.1   2728.3 3.1712  1    0.07495 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

From the summary, it does not appear as if the interaction term would have any significant effect on the model as a whole (t value 1.785). Neither does treatment have any statistically significant impact (t value -1.2).

As a final step, we can reprint the original observed values study how they compare with the fitted values of the model. As it appears, the trend visible in the observed values is even more clearly visible in the model using the fitted values. I would conclude that time spent in care clearly has an effect on the brief psychiatric rating scale score measured among the 40 men here. Whether the decline is due to the treatment is unclear. From the data analysed here, it cannot be concluded that one treatment would be better than the other, or that the treatment would be effective in the first place. For us to be able to advance such a claim, we would have needed a control group receiving no treatment at all.

# Creating a vector of the fitted values
fitted <- fitted(BPRS_ref2)
# Create a new column fitted to BPRSL
BPRSL <- mutate(BPRSL, fitted=fitted)
# Reprinting the plot with observed BPRS values (for the sake of comparison)
ggplot(BPRSL, aes(x = week, y = bprs, group = interaction(treatment, subject), colour = treatment)) + 
  geom_line() +
  scale_x_continuous(name = "Time (weeks)") +
  scale_y_continuous(name = "BPRS score") +
  theme(legend.position = c(0.85,0.85)) + 
  ggtitle("Observed BPRS values") + 
  theme(plot.title = element_text(hjust = 0.5))

# plotting the fitted values
ggplot(BPRSL, aes(x = week, y = fitted, group = interaction(treatment, subject), colour = treatment)) +
  geom_line() +
  scale_x_continuous(name = "Time (weeks)") +
  scale_y_continuous(name = "BPRS score") +
  theme(legend.position = c(0.85,0.85)) + 
  ggtitle("Fitted BPRS values") + 
  theme(plot.title = element_text(hjust = 0.5))